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Abstract. Many LOD datasets, such as DBpedia and LinkedGeoData, 
are voluminous and process large amounts of requests from diverse ap¬ 
plications. Many data products and services rely on full or partial local 
LOD replications to ensure faster querying and processing. While such 
replicas enhance the flexibility of information sharing and integration 
infrastructures, they also introduce data duplication with all the associ¬ 
ated undesirable consequences. Given the evolving nature of the original 
and authoritative datasets, to ensure consistent and up-to-date replicas 
frequent replacements are required at a great cost. In this paper, we in¬ 
troduce an approach for interest-based RDF update propagation, which 
propagates only interesting parts of updates from the source to the tar¬ 
get dataset. Effectively, this enables remote applications to ‘subscribe’ 
to relevant datasets and consistently reflect the necessary changes lo¬ 
cally without the need to frequently replace the entire dataset (or a 
relevant subset). Our approach is based on a formal definition for graph- 
pattern-based interest expressions that is used to filter interesting parts 
of updates from the source. We implement the approach in the iRap 
framework and perform a comprehensive evaluation based on DBpedia 
Live updates, to confirm the validity and value of our approach. 

Keywords: Change Propagation, Dataset Dynamics, Linked Data, Replication 


1 Introduction 

In recent years, there has been an increasing number of structured data pub¬ 
lished on the Web as a Linked Open Data (LOD). Last years assessment of 
the size of the LOD clouc0 for example reported more than 1.000 published 
datasets comprising almost 100 Billion triples. Methods for accessing LOD are 
SPARQL endpoints. Linked Data resource documents or data dumps. Many of 
these datasets, such as DBpedia and LinkedGeoData, are voluminous and pro¬ 
cess large amount of requests from diverse applications. Providing services on 
top of these datasets is becoming a challenge due to the lack of service levels 
regarding the availability of datasets and restrictions imposed by the publisher 
on the type of query forms and number of results. 

Replication of Linked Data datasets enhances flexibility of information shar¬ 
ing and integration infrastructures. Since hosting a replica of large datasets, such 
as DBpedia and LinkedGeoData, is costly, organizations might want to host only 
a relevant subset of the data, for example, using approaches such as RDFSlice |3]. 
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Fig. 1: Changeset propagation approaches: right part - Interest-based replica 
(iRap Replica); left part - Live mirror replica (Live Replica) 


However, due to the evolving nature of these datasets in terms of content and 
ontology, maintaining a consistent and up-to-date replica of the relevant data is 
a major challenge. Resources in a dataset might be added, updated, or removed. 
The frequency of such changes depends on the type of data stored in a dataset. 
For example, sensor data or geolocation data from mobile devices changes more 
frequently than archival data. These changes should be dealt with by Linked 
Data consumption applications in order to keep local repositories consistent. 


Typically, a dataset mirror application propagates a changeset, published by 
the source dataset, to a target dataset. For example, the DBpedia Live mirror 
1 00 propagates all changesets to a target dataset, so that at time t the target 
dataset contains the same triples as the source dataset. However, for example, an 
application interested in athletes uses only 268,773 out of 364,810,370 instances 
of the English DBpedia 2014 dataset. An interest-based update propagation 
could significantly reduce the amount of data to be shipped and managed at the 
application side and thus lower the barrier for the deployment of Linked Data 
applications. In this paper, we present an approach for interest-based update 
propagation, which is based on the specification of data interests by a target 
application. Based on such interest expressions all updates are evaluates at the 
source and only those are shipped to the target application, which are either di¬ 
rectly interesting or could become interesting in subsequent updates. We provide 
a thorough formalization of our approach. [Figure I| shows that propagation of 
unfiltered data from Source to Target-2 (in part b) syncing the complete change- 
set irrespective of the relevant or useful data whereas, the propagation of filtered 
data using iRap from Source to Target-1 (for part a) transfers only relevant data. 
Our evaluation shows, that the data required to be transfered and handled by 
applications can be reduced by several orders of magnitude thus substantially 
lowering the re-usage barrier for Linked Data. 


The article is structured as follows: [section 2 extensively describes the formal¬ 
ization for our framework, [section 3| and section 4 discusses the implementation 
and evaluation of the iRap framework in detail. |section~5 describes the related 
work. Finally, [section 6| concludes and proposes directions for future work. 
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2 Formalization of Interest-based RDF Updates 


Figure 2 illustrates the overall interest-based RDF Update Propagation ap¬ 


proach; summarizing the concepts defined through the formalization. Interest 
evaluation takes place over the input set of deleted and added {Ati — to) 

triples from the source dataset (Vti) in between time interval (to,ti)- Since up¬ 
dates can not only contain interesting and uninteresting parts but also triples, 
which can become potentially interesting along with subsequent updates, we 
have to compute and store these sets of potentially interesting triples and take 
them in subsequent update assessments into account. 

For our formalization we will use the standard notations I, B, L and Var 
for the disjoint sets of all IRIs, blank nodes, literals (typed and untyped) and 
variables respectively. An RDF graph V is a finite set of RDF triples, i.e, V 
(lu B) X I X (luBuL). In this paper we use the terms RDF graph, RDF dataset, 
and dataset interchangeably. 


Definition 1 (Evolving Dataset). An evolving dataset U® is a dataset iden¬ 
tified using the persistent IRI g whose content changes over time. Vf denotes a 
specific revision ofV^ at a particular time t. For simplicity, we will just refer to 
Vt instead of Vf. 

Definition 2 (BGP). A SPARQL basic graph pattern (BGP) expression is 
defined recursively as follows: 

1. a triple pattern tp e {I\j B u Var) x {lu Var) x {lu L\j Var) is a BGP 

2. the expression (PI AND P2) is a BGP, where PI and P2 are themselves 
BGPs 

3. the expression (P FILTER E) is a BGP, where P is a BGP and E is a 
SPARQL filter expression that evaluates to boolean value. 

Definition 3 (Non-disjoint BGP). A non-disjoint BGP is a BGP that rep¬ 
resents a connected graph. 


j Interest 

-£liii—mu 



Fig. 2: Formalization overview of the interest-based RDF update propagation. 









































An optional graph pattern (OGP) is syntactically specified with the OPTIONAL 
keyword applied to a graph pattern. A set of triple patterns in a BGP must 
match for there to be a solution whereas triple patterns in OGP may extend the 
solution but their non-binding nature means that they cannot reject it. [T] 

Definition 4 (Partial Matches). Partial matches are a set of triples that does 
not fully match the BGP but matches at least one triple pattern in BGP or OGP 
of a query. 

Triples added to, and removed from, an evolving dataset within a time-frame 
are called changeset for a dataset within that time-frame. 

Definition 5 (Changeset). Let 14^ be an evolving dataset at time ti. A change- 
set Z\(VtJ, between and Vt^, where to <ti, is defined as: 

where: is a set of removed triples from Vt„ between time-points to and ti, 

and is a set of added triples to Vtg between time-points to and G. 

Ghangesets can be computed using the difference between two versions of the 
RDF dataset. The result of this computation gives the removed triples, Dt^-tg = 
Vb\f4, and added triples, At^-tg = Vi\Vb, between given dataset revisions Vtg 
and bij. Datasets can be accompanied with a tool that publishes changesets at 
real-time, so that users can download these and synchronize their local replicas. 
For instance, DBpedia publishes updates in a public changesets foldeij^ 

Example 1. Let us assume the following two file^are being published by the 
DBpedia Live extractor for the changes made on Feb 06, 2015 between 05:00 
PM (to) and 05:02 PM (G): 


Listing (1.1) File 000001.removed.nt 


dbr:Marcel 

dbp:goals 1 . 

dbr:Marcel 

dbo:team dbr:FNFT . 

dbr:Tim%02 

foaf:name 


"Tim Berners-Lee" . 

dbr:Cristiano-Ronaldo 

dboigoals 96 . 


Listing (1.2) File 000001.added.nt 


dbr:Cristiano_Ronaldo 

dbo:goals 

216 . 

dbr:Barack_Obama 

foaf:name 

"Barack Obama" . 

dbr:Barack_Obama 

foaf:homepage 

"http: //WWW. barackobama.com/" 

dbr:Rio_Ferdinand 

a 

foaf:Person . 

dbr:Rio_Ferdinand 

a 

dbo:Athlete . 

dbr:Rio_Ferdinand 

dbp:goals 

2 . 

dbr:Arvid_Smit 

a 

dbo:Athlete . 


A changeset Z\(bij) for the DBpedia Live dataset between to and ti, contains 
^05:02-05:00 = 000001.reTOOweci.nt and ^05,02-05:00 = 000001.added.nt. That is, 
A(Vb5:02) = (00 0 001 .rernoued.nt, 000001 .added.nt) 

^ http://live.dbpedia.org/changesets/ 
prefixes can be checked in http://prefix.cc/ 





Definition 6. (Changeset Propagation) A changeset propagation is a func¬ 
tion V that transforms a given dataset Vt^ to a new dataset Vt^ by applying a 
changeset, That is: 

v{Vt„A{Vt,)) = u = Vt, 

The changeset propagation function v, for example, deletes the triples in 000001.re¬ 
moved.nt from the target dataset and then inserts all triples from 000001.added.nt. 
This order of operation (deleted first) ensures that inserted triples are not re¬ 
moved again immediately. 

If an organization maintaining a replica wants to host only a subset of the 
original dataset it needs to obtain only relevant updates for this subset. For that 
purpose, we specify interests to subscribe to ‘interesting’ changes only. Dur¬ 
ing interest registration, an organization provides information about the source 
dataset to synchronize with, a target dataset endpoint that supports SPARQL 
Updat^ to propagate interesting changes, and an interest expression to select 
relevant parts of a changeset. Below, we present a formal definition for interest 
expression over an evolving dataset. 

Definition 7 (Interest Expression). An interest expression over an evolving 
dataset, Vf, is defined as: 

ig = <T, b, op) 

where g is an IRI identifying an evolving RDF dataset , t is an IRI identifying 
the target dataset endpoint, b is a non-disjoint BGP, and op is an optional graph 
pattern (OGP) connected to b. 

Example 2. An interest expression for a list of an athlete with information about 
goals scored, and optionally their homepage, is expressed as follows: 

— g = http://live.dbpedia.org/changesets 

— T = http://localhost:3030/target/sparql 

— b = { ?a a dboiAthlete . ?a dbpigoals ?goals . } 

— op = { ?a foaf : homepage ?page . } 

The equivalent interest expression SPARQL query will be: 

SELECT * WHERE { ?a a dbo:Athlete . ?a dbp:goals ?goals . OPTIONAL { ?a foaf:homepage ?page . } } 

In order to initialize a local data store, i.e., the target dataset, SPARQL CON¬ 
STRUCT queries can be used by employing the interest expression’s BGPs to 
extract and load a subset of the source dataset. Then interest expressions are 
registered with iRap to retrieve interesting updates from the source dataset. 
iRap evaluates interest expressions over changesets being published along with 
the source dataset. Without a restriction of generality, we assume interest ex¬ 
pressions here to be static for the lifetime of a target dataset, since an evolution 
of interest expressions can be simulated by removal and addition. The result of 
executing an interest evaluation for an interest expression against a changeset are 
three sets or triples: 1. interesting, 2. potentially interesting, and 3. uninteresting 
triples. 
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Definition 8 (Interesting Triples). Interesting triples are all triples com¬ 
prised in full matches of the BGP and possibly OGP of an interest expression, 
ig, against the sets of added or deleted triples of a changeset. Interesting triples 
originating from the first element (i.e., removed triples (Dt^-tg)) of a change- 
set, A{Vtj^), are called interesting-removed triples. Interesting triples originating 
from the second element (i.e., added triples (At^-t„)) of a changeset, A{Vtf), are 
called interesting-added triples. 

In addition to parts of an changeset for which the ‘interestingness’ can be 
immediately decided, there might also be parts, which are potentially interesting 
since, i) the missing parts to render them as interesting are already contained in 
the target knowledge base or ii) they will be propagated in subsequent updates. 

Definition 9 (Potentially Interesting Triples). Potentially interesting triples 
are triples comprised in partial matches of the BGP or in OGP of interest ex¬ 
pression, ig.' 

— Potentially interesting triples originating from the first element (i.e., re¬ 
moved triples (Dt^-tg)) of a changeset AiVtf), are called potentially interesting- 
removed triples. 

— Potentially interesting triples originating from the second element (i.e., added 
triples (At.^_tg)) of a changeset, A{Vtf), are called potentially interesting- 
added triples. 

Potentially interesting triples can become interesting if triples missing in the 
changeset but required for a full BGP match are found in the target dataset 
or in subsequent changesets. Finally, there are triples in the changeset that are 
neither interesting nor potentially interesting. 

Definition 10 (Uninteresting Triples). Uninteresting triples are triples that 
do not match any triple pattern in a BGP or OGP of any interest expression, 
ig, against the sets of added or deleted triples of a changeset. 

Uninteresting triples are not interesting at the moment and can never become 
interesting with subsequent changesets. iRap uses an interest query to select 
candidate triples from a changeset and to assert from a target dataset. These 
candidates are retrieved in decreasing order of matching BGP triple patterns of 
interest expressions and triples that match any part of optional graph patterns. 
Formal definition of interest candidate generation from a changeset is: 

Definition 11 (Interest Candidate Generation). An interest candidate gen¬ 
eration is the extraction of matching triples from a changeset for a non-disjoint 
combination of triple patterns in BGP of an interest expression, ig. The result 
of this extraction is an (n + 1)-tuple with decreasing order of matching: 

(cq, Cl, ..., Cn—\, Cop) 


where: 

— M is a set of removed (respectively added) triples in a changeset, 

— n is the number of triple patterns in the BGP of interest expression, ig. 



— Ck is a set of candidate triples in M that match n — k (0 ^ k < n) triple 
patterns of the BGP (and optionally OGP) of the interest expression, ig, and 

— Cop is a set of candidate triples in M that match at least one triple pattern 
in the OGP of interest expression, ig, but none of the triple patterns in the 
BGP. 

Example 3. An interest candidate generation for the interest expression ig from[E|x- 
[ample 2 [ over the changeset from [Example 1| gives the following result: 

1. 7r(ig,i:)o5:02-05:Oo) = <co,ci,Cop> where: 

Co = 0 

Cl = dbriMarcel dbp:goals 1. dbr:Cristiano_Ronaldo dbo:goals 96. 

Cop 0 

2 . 7 r(ig, Ao5:02-05:Oo) = (cq, Cl, Cop) where: 

Co = dbr:Rio^Ferdinand a dbo:Athlete . dbr:Rio_Ferdinand dbp:goals 10. 

Cl = dbr:Cristiano_Ronaldo dbpigoals 216 . dbr:Arvid„Smit a dbo:Athlete. 
Cop = dbr:Barack^Obama foaf:homepage "http://www.barackobama.com". 

Now an interest candidate assertion verifies candidate triples with respect to 
all triple patterns in the BGP of an interest expression. 

Definition 12 (Interest Candidate Assertion). The candidate assertion 
function extracts missing triples for the candidate, Ci of 7r{ig, M) of an inter¬ 
est expression ig from the target dataset, Xtg: 

TT'{ig,M) = (c)p,c'„_i,...,c;,Co> 


where: 

— M is a set of removed (respectively added) triples in a changeset, 

— n is the number of triple patterns in the BGP of interest expression, ig, 

— Pop c of triples from target dataset, t, that matches the missing optional 
graph patterns for candidate Cq, of ir^ig, M), 

— Pf. is a set of triples from target dataset, t, that matches the missing triple 
patterns for candidate c„_fc, where 0 < k < n, of Tr{ig, M), and 

— Cg is a set of triples from target dataset, t, that matches all triple patterns 
in BGP of interest expression for candidate Cop, of Tr{ig, M). 


Example Let the target dataset, Ttg, at time to contains the following triples: 


#Target dataset at time 

to = 05:00 PM 

Feb 06, 2015 

dbr:Marcel 

a 

dboiAthlete . 

dbr:Marcel 

dbpigoals 

1 . 

dbr:Cristiano-Ronaldo 

a 

dboiAthlete . 

dbr:Cristiano-Ronaldo 

dbo:goals 

96 . 

dbr:Cristiano-Ronaldo 

foaf:homepage 

"http://cristianoronaldo.com" . 


An interest candidate assertion for interest candidates generated in [Example 3| 
yields the following result: 


1. 7r'(ig,Do5:02_o5:Oo) = (c'op, c'l, Cg) where: 









c'op = 0 

di = dbriMarcel a dboiAthlete . 

dbr:Cristiano_Ronaldo a dboiAthlete . 

dbr:Cristiano„Ronaldo foaf:homepage "http://cristianoronaldo.com" 

co = 0 

2. 7r'(ig,^05:02-05:00) = (cop>Ci,Co> where: 

C'op = 0 

c'l = dbr:Cristiano_Ronaldo a dboiAthlete . 

dbr:Cristiano^Ronaldo foaf:homepage "http://cristianoronaldo.com" 

c;, = 0 


The interest evaluation over a changeset Z\(Vtj) is performed in two steps. 
First, interest expressions are evaluated against removed triples of a change- 


set as d{ig, see Definition 13 Second, interest expressions are evaluated 

against added triples of a changeset as a{ig, see Definition 14 During in¬ 


terest evaluation, added triples are combined with potentially interesting triples 
from previous changesets (i.e., Iti-to = Pto) to check their potential 

promotion to interesting triples. 

Definition 13 (Interest Evaluation over Deleted Triples). Interest eval¬ 
uation over deleted triples is a function, d{ig,Dt^-tg), that returns a 3-element 
tupl^ 


where: 

— TT(ig, is an interest candidate generation against deleted triples, 

— 'K'{ig,Dt^-to) is an interest candidate assertion against deleted triples, 

— rt-,-tg = {couc/cUCop|co,Cfe,Cop G 7r(ig,L)tj_to) and Cq G 7r'(zg, Ai-to)} 

is the set of interesting removed triples, i.e., no longer interesting, 

{ofc ^op\^k^ ^op ^ kind Cq g tt (ig, is 

the set of potentially interesting removed triples (existing only in removed 
triples of a changeset) and 


^ ^op I Oq ; 0^ , C^p G TT (ig , Dti—to ) and 3Cop, Cn—k ; Cq G 7r(ig , 
is the set of triples that become potentially interesting after removing rt^-tg. 


Dti-to 


Example 5. An interest evaluation over deleted triples in our running example 
(using the results of Example 3 and Example"^ respectively) is as follows: 

d(ig, Do5:02-05:Oo) = 7’‘(*g) .C)o5:02-05:Oo) 0(*g) .Do5:02-05:Oo) 

= {^05:02-05:00) D(05:02-05:00)) ^05:02-05:Oo) 


1- ’’05:02-05:00 — Cl (in 

dbr:Marcel dbp:goals 1 . 

dbr:Cristiano_Ronaldo dbo:goals 96 . 


Example 3 


u* indicates that after the component-wise union of the two sets the results are 
combined to three categories of the resulting 3-tuple, namely, (i) elements from left 
that have matching right elements, (ii) elements from left that do not have matching 
right elements, and (iii) element from right that have a match left. 










2- ^i( 05:02 - 05 : 00 ) = 0 (Since all the potentially interesting removed triples of Ci 
in Example 3 becomes interesting and no other triples in Cop) 


3- ’’05:02-05:00 ~ ''1 


dbr:Marcel a dbo:Athlete . 

dbr:Cristiano_Ronaldo a dbo:Athlete . 

dbr:Cristiano.Ronaldo foaf:homepage "http://cristianoronaldo.com" 


Definition 14 (Interest Evaluation over Added Triples). Interest evalu¬ 
ation over added triples is a function, a{ig, At-,-to), that returns 3 element tuple 
as: 


where: 

— Iti-to = ^ti-to ^ Pto is a set of added triples and potentially interesting 
triples dataset, 

— -nfig, is an interest candidate generation over hi-toi 

— Tr'{ig, is an interest candidate assertion over It,-tg, 

- at,-to = {coucfcUCop|co,Cfe,Cop e ■n{ig,It,-to) and3c'„_^,CQ e Tr'{ig, It,-to)} 
is the set of interesting added triples, 

- a*(ti-to) = {cfc ^ Cop\ck,Cop e Tr{ig,It,-to) and Cq e TT'{ig,It,-to)} is 

the set of potentially interesting added triples that do not have related triples 
in target dataset, and 

- a't,_to = {c'puc'fcuc'oplc'o, c'l,, c'op e n'iig, It,-to) and 3cop, Cn-k, co e rriig, It,-to 
respectively} is the set of triples from target dataset that are related to ai(t,_t„). 


Example 6. An interest evaluation over added triples in our running example 
(using the results of Example 3 and Example"^ respectively) is as follows: 


0:{ig, ^05:02-05:00) 


— 7r(jg, Jo5:02-05:Oo) 0(*g0O5:O2-O5:Oo) 

= (a05:02-05:00, «i(05:02-05:00)! ao5:02-05:Oo) 


1- 005:02-05:00 — Cl U U Cq 


dbr:Cristiano.Ronaldo 
dbr:Cristiano.Ronaldo 
dbr:Cristiano-Ronaldo 
dbr:Rio_Ferdinand 
dbr:Rio_Ferdinand 


dbo:goals 216 . 

a dbo:Athlete . 

foaf:homepage "http://cristianoronaldo.com" 

a dbo:Athlete . 

dbp:goals 10 . 


2- Oi(05:02-05:00) — 

dbr:Arvid_Smit a dboiAthlete . 

dbr:Barack_Obama foaf:homepage "http://www.barackobama.com" 


3. a 


05:02-05:00 


0 


Now, we will use the results from |Definition and |Definition to compute 
interesting and potentially interesting changesets. 


Definition 15 (Interest Evaluation). An interest evaluation over a change- 
set A{Vt,) at time ti is a function e{ig, A{Vt,)) that combines the results from 











an interest evaluation over deleted triples, d{ig,Dt^-to), and an interest eval¬ 
uation over added triples, a{ig,It^-to), to return an interesting changeset and 
potentially interesting changeset as follows: 

e{ig,A{Vtf)) = d{ig,Dt^_to) X oi{ig,It^-tg) = A{ptf)) 


where ig is an interest expression over an evolving dataset, A(Tt^) is an interest- 


ing changeset (see 

Definition 16), and ^{pti) is potentially interesting changeset 

(see Definition 11 

nteresting Changeset). Let Ttg be a target dataset at time 
changeset, A{Ttf), for Ttg at time ti is defined as: 

Definition 16 (I 

to- An interesting 




where: 

— rt,^-to is the set of interesting removed triples, interesting removed optional 
triples and potentially interesting removed triples with match found in target 
dataset during candidate generation, T:{ig,Dt^-to), 

— ’’’ti-to of triples from target dataset that are related to potentially 

interesting removed triples computed by Tr'{ig, Dt^-tg), and 

— ati-to is the set of interesting added triples, interesting optional triples and 
potentially interesting added triples with match found in target dataset during 
candidate generation, T:{ig,At^-tg)- 

Example 7. An interesting changeset for our running example is as follows: 

^("^05:02) = {[’’05:02-05:00 ’'o5:02-05:Oo]: Q05:02-05:Oo) 


1. interesting removed triples - [ro 5 : 02 - 05:00 ^ ’'o 5 : 02 - 05 :Oo] ^ 


dbr:Marcel 

a 

dbo:Athlete . 

dbr:Marcel 

dbp:goals 

1 . 

dbr:Cristiano-Ronaldo 

dbo:goals 

96 . 

dbr:Cristiano.Ronaldo 

a 

dbo:Athlete . 

dbr:Cristiano.Ronaldo 

foaf:homepage 

"http://cristianoronaldo.com" . 


2. interesting added triples - ao 5 : 02 - 05:00 : 


dbr:Cristiano-Ronaldo 

dbo:goals 

216 . 

dbr:Cristiano.Ronaldo 

a 

dbo:Athlete . 

dbr:Cristiano.Ronaldo 

foaf:homepage 

"http://cristianoronaldo.com" . 

dbr:Rio_Ferdinand 

a 

dbo:Athlete . 

dbr:Rio_Ferdinand 

dbp;goals 

10 . 


Definition 17 (Potentially Interesting Changeset). Let ptg he a potentially 
interesting dataset for interest expression ig at time tg. A changeset, A{ptf), for 
Ptg at time ti is defined as: 


where: 

— ri(^t^_tg) is a set of potentially interesting removed triples. 









— is a set of potentially interesting added triples computed on added 

triples of a changeset and related triples extracted from target while removing 
potentially interesting removed triples, and 
~ Ki-to of triples from target dataset that are related to potentially 

interesting removed triples computed by T:'{ig,Dt^-to)- 


Example 8. Potentially interesting changeset for our running example is as fol¬ 
lows: /\(po5:02) = {’'i(05:02-05:00)j [fli(05:02-05:00) ^ ’'05:02-05:Oo]) 

1. Potentially interesting removed triples - rj(g 5 . 02 - 05 : 00 ) = 0 

2. Potentially interesting added triples - [ 0 ^( 95 : 02 - 05 : 00 ) ^ ^o 5 : 02 - 05 :Oo] 


dbr:Arvid_Smit 

a 

dbo:Athlete . 

dbr:Barack_Obama 

foaf:homepage 

"http;//www.barackobama.com*' . 

dbr:Marcel 

a 

dboiAthlete . 


Note: since all triples in rgg. 92 - 95:00 added back to target dataset, they are 

no longer stored in the potentially interesting dataset. 


Definition 18 (Interesting Update Propagation). An interesting change- 
set propagation is an update operation that transforms the target dataset Tt„ to 
the new dataset Tt-^ and ptg to new dataset pt^ by applying the result of interest 
evaluation, e{ig, A{Vt„)). That is: 

r{ig,A{Vt,)) = v{Ttg,A{Tt,)) A v{ptg,A{pt^)) = Tt, A pt^ 

- is a changeset at time ti, 

- v{Tt„,A{Tt^)) = [rto\[rtj-to u u Oti-to is changeset propagation of 

interesting changeset, and 

- v{pto,0{pti)) = [pto\U(ti-to)] iHti-to) ^ changeset propagation 

of potentially interesting changeset. 


Example 9. Propagation of an interesting changeset of Example 7| to the target 
dataset, Ttg and potentially interesting changeset of Example 8 to the potentially 
interesting datasetptp transforms the datasets to: 


Listing (1.3) Resulting target dataset 


dbr:Cristiano.Ronaldo 

dbo:goals 

216 . 

dbr:Cristiano_Ronaldo 

a 

dbo:Athlete . 

dbr:Cristiano_Ronaldo foaf:homepage 

'‘http://cristianoronaldo.com" . 

dbr:Rio_Ferdinand 

a 

dbo:Athlete . 

dbr:Rio_Ferdinand 

dbp:goals 

10 . 


Listing (1.4) Potentially interesting 


dataset after change propagation 


dbr:Arvid.Smit 

a 

dbo:Athlete . 

dbr:Barack_Obama 

foaf:homepage 



"http://www.barackobama.com" . 

dbr:Marcel 

a 

dbo:Athlete . 









3 iRap RDF Update Propagation Framework 

In this section we describe the architecture of our interest-based update propa¬ 
gation framework iRap and its implementation. iRap was implemented in Java 
using Jena-ARQ. It is available as open-sourc^ and consists of three modules: 
(1) Interest Manager (IM), (2) Changeset Manager (CM) and (3) Interest Evalu¬ 
ator (IE), each of which each can be extended to accommodate new or improved 
functionality. 

Changeset evaluation starts after a user registers an interest expression us¬ 
ing the IM service, as shown in [Figure 3| The CM module fetches a list of 
changeset folders from interest expressions and regularly (configurable) checks 
for new changesets. After downloading and decompressing new changesets, the 
CM notifies the IE, which then imports a list of interest expressions registered 
for this particular changeset through the IM and initiates the evaluation. Result¬ 
ing interesting triples are propagated to the target dataset whereas potentially 
interesting triples are stored in the potentially interesting dataset (p). After all 
interest expressions have been evaluated over the changeset, the IE notifies the 
CM to clean the downloaded files. 



Fig. 3: Architecture of the iRap interest-based RDF update propagation frame¬ 
work. 


4 Evaluation 

To evaluate the proposed approach, we performed experiments on the iRap 
framework using changesets published by DBpedia and compared the results 
with the DBpedia Live Mirror tool. The comparison considers two cases: us¬ 
ing iRap to update a previously-established local replica of i) an entire remote 
dataset ii) a subset of a remote dataset. These two cases simulate two ways in 
which iRap can be used: i) using interest-based changeset propagation for future 
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Date 

Oct 01 

Oct 02 

Oct 03 

Oct 04-12 

Oct 13 

Oct 14 

Oct 15 

Total Changesets 

0 

1,621 

1,755 

0 

5,352 

751 

2,578 


Table 1: Distribution of DBpedia Live changesets published October 01-15, 2014. 


Listing (1.5) Location interest query 


CONSTRUCT WHERE { 


?location 

a 

?type . 

?location 

wgs:long 

?long . 

?location 

wgs:lat 

?lat . 

?location 

rdfs:label 

?label . 

?location 

dbo:abstract 

?abstract . 

OPTIONAL { 

} 

Tlocation dcterms:subject ?subject } 


Listing (1.6) Football interest query 


CONSTRUCT WHERE 

{ 


?footballer 

a dbo:SoccerPlayer . 

?footballer 

foaf:name 

?name. 

?footballer 

dbo: team 

?team . 

?team 

rdfs:label 

TteamName. 

} 




updates of a local copy of a large dataset or ii) starting with a new subset of the 
large dataset. 


Experimental Setting In order to test our approach we used the DBpedia 
dump[^of September 30, 2014 for the initial setup of the target datasets for two 
different application domains, namely. Location and Football datasets. Change- 
sets published between October 01 and October 15, 2014 (see Table 11 were used 
for evaluatioij^ Initially we set up two TDB datasets for each target dataset 
from the DBpedia dump. We loaded all triples from the dump to the Location 
dataset, whereas for the Football dataset we only loaded slice corresponding to 
interesting triples matching [Listing 1.6| 

Initially, the Location dataset contains all triples from DBpedia yielding a 
total of 364,810,370 triples, whereas the Football dataset contains only 265,622 
triples. A total of 12,057 changesets (pairs of removed and added .nt.gz files) 
have been published in the evaluation timeframe. 

The evaluation comprises two interest expressions, Ii and l 2 - Ii comprises 
a non-disjoint BGP containing 4 triple patterns with a maximum of two vari¬ 
ables per triple pattern (object-subject join) Listing 1.6 I 2 comprises a non- 


disjoint BGP containing 5 triple patterns with a maximum of two variables per 
triple pattern (subject-subject joins) and one an OGP containing one triple pat¬ 
tern [ListingjX^ 

We set up two target datasets and potentially interesting dataset using Jena 
TDB and jena-fuseki for each dataset. The potentially interesting dataset stores 
potentially interesting triples for each interest expression within a named graph. 
All experiments were carried out on a 64-bit machine with Windows 7, Intel(R) 
Core i7-4770 CPU, 16GB RAM and 1TB HD. 


Evaluation Results and Discussion [Figure 4] summarizes our experimental 
results for two target datasets shows the growth of the potentially interesting 
dataset. Results of the interest evaluation for the Football dataset are presented 
in Table 2 From the overall changesets considered for this evaluation, in Table 1[ 


® http://live.dbpedia.org/dumps/dbpedia_2014_09_30_00_00.fixed.ttl.gz 
® http://live.dbpedia.org/changesets/2014/10/ 


























Day 

Total 

Removed 

Interesting 

Removed 

Total 

Added 

Interesting 

Added 

Potentially 

Interesting 

Elapsed 
(in minutes) 

1 

1,895,179 

9,065 

2,051,976 

184 

169,554 

15.18 

2 

1,748,511 

4,865 

2,384,232 

155 

168,856 

20.85 

3 

1,716 

0 

10,728,855 

45,429 

684,491 

69.86 

4 

449 

0 

1,522,939 

7,970 

97,300 

10.17 

5 

1,677 

0 

5,234,788 

19,598 

333,232 

60.06 


Table 2: Comparison of results for Football App 


Day 

Total 

Removed 

Interesting 

Removed 

Total 

Added 

Interesting 

Added 

Potentially 

Interesting 

Elapsed 
(in minutes) 

1 

1,895,179 

77,377 

2,051,976 

7,093 

430376 

166.59 

2 

1,748,511 

82,461 

2,384,232 

7,301 

509,972 

242.62 

3 

1,716 

0 

10,728,855 

259,587 

2,002,271 

417.87 

4 

449 

0 

1,522,939 

27,292 

280,718 

64.41 

5 

1,677 

0 

5,234,788 

100,073 

972,284 

176.78 


Table 3: Comparison of results for Location App 


only 0.38% of the removed and 0.335% of the added triples were identified as 
interesting for the Football dataset. The average changeset publication interval 
was 18.81s and average time required for a changeset evaluation is 0.87s. This 
shows that iRap efficiently performs changeset propagations way before the next 
changeset is published. 

Results of the interest evaluation for the Location dataset are shown in ITa-l 
|ble 3| From the overall changesets considered for this evaluation, in [Table T| only 
4.38% of the removed and 1.81% of the added triples were interesting for the Lo¬ 
cation dataset. The average time spent for a changeset evaluation is 5.31s. The 
interest evaluation for the Location dataset takes longer than Football dataset, 
because of the number of triples in the target dataset was a the full DBpedia. 

[Figure 4a| shows the number of triples published per a day and the num¬ 
ber of interesting triples and potentially interesting triples found from interest 
evaluation for Football dataset. [Figure 4b| shows the dataset growth compari¬ 
son between iRap and a full mirror approach. As the figure clearly shows, iRap 
managed datasets are almost two orders of magnitude smaller and grow much 
slower than with a mirror approach. Note that the growth for each datasets is 
calculated by subtracting the number of removed triples from and adding the 
number of added triples to the total number of triples in the dataset. 

[Figure 4e] shows a substantial growth of potentially interesting dataset for 
Location and Football datasets. This is due to the number of variables used in 
triple patterns, and the number and type of triple patterns in interest expression. 
For example, the Football dataset interest query contains the common predicates 
f oaf: name and rdf s: label which are used in almost all resources and thus result 
in many potentially interesting triples. Exploring further options to reduce the 
growth of the potentially interesting dataset is thus an interesting direction for 
future work. Again, the average processing time per changeset is always way 
below the average time between two changesets. The correctness of the resulting 
triples from the first changesets, for Football dataset interest expression, was 
checked by manual inspection. 
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(a) Football dataset changes per day 
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11 IE 21 26 31 36 41 46 51 56 
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(e) Potentially interesting dataset growth 
Fig. 4: Evaluation results 


5 Related Work 

Most related work on dataset change detection and propagation focuses on dis¬ 
tributed publish/subscribe systems |7l3j . resource link maintenance lain], target 
synchronization [S] , partial replicas |H] , data-shipping m , lazy updates [2], and 
real-time update notification m. In [3, the authors propose a peer-to-peer 
publish/subscribe system for events described in RDF. By avoiding the use of 
multiple indexes for the same publication they manage to reduce storage space. 
Similarly, [3] provide an implementation with publish/subscribe capabilities in 
an RDF-based peer-to-peer system to manage digital resources. As for resource 
link maintenance, DSNotify [3j offers a change-detection framework to detect and 
fix broken links between resources in two datasets while. Semantic Pingback HD] 
proposes a notification system for the creation of new links between Web re¬ 
sources. To note that this approach is suitable for relatively static resources, i.e. 



























RDF documents or RDFa annotated Web pages. In contrast, SparqlPuSH [5] 
offers a real-time notification framework for data updates in a RDF store us¬ 
ing a semantic PubSubHubbub-hased protocol (PuSH). SparqlPuSH allows users 
to subscribe for changes updates of a subset of content in a RDF store using 
SPARQL. However, notification and broadcasting are only available as RSS and 
Atom feeds. As regards target synchronization, RDFSync [5] performs update 
synchronization by merging source and target graphs to get the updated target 
RDF graph. Alternatively, [5] has designed an approach to replicate, modify, 
and write-back parts of an RDF graph on devices with low computing power. 
However, this approach does not resolve conflicts arising with concurrent mod¬ 
ifications on both the base graph and the partial replicas. In the field of object 
database management systems, a data-shipping client-server architecture, such 
as in is used for data distribution. The aim is to optimize resource utiliza¬ 
tion at client side where the data objects from the server are cached for future 
use. In distributed databases, where data is replicated on different sites. Lazy 
update protocols [5] disseminate updates to replicas to ensure consistency. These 
protocols guarantee serializable execution as well as high performance. 

6 Conclusion and Future Work 

In this paper we presented a novel approach for interest-based RDF update prop¬ 
agation that can consistently maintain a full or partial replication of large LOD 
datasets. We have demonstrated the validity of the approach through detailed 
formalizations and their application in a reference implementation of the iRap 
Framework. An thorough evaluation of the approach, using large-scale real-world 
data dumps and changesets regularly provided by a renowned LOD dataset, in¬ 
dicates that our method can significantly cut down on both the size of the data 
updates required to consistently maintain a localized dataset replication up-to- 
date, as well as the speed by which such updates can take place. 

Future work will focus on extending the iRap Framework with a publish/- 
subscribe distributed architecture as described in the related work (Section . 
The framework will be improved also from the usability point of view, including 
a user interface and making the initial generation of RDF slices easier and more 
efficient. Finally, an extensive evaluation of scalability and performance of the 
framework will be performed and a benchmark dataset for future reference will 
be made available to the research community. 
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